Using both dataprep.eda and matplotlib to create basic plots of our fields.
After importing pandas, add this import line:
import matplotlib.pyplot as plt
First install dataprep using the command prompt or terminal
(Remember, if you are using an environment to install it within your environment.)
pip install dataprep
Then add this import line:
from dataprep.eda import plot, plot_correlation
For creating plots for exploratory data analysis, see the documentation here:
To learn more about the DataPrep.eda library:
import pandas as pd
import matplotlib.pyplot as plt
from dataprep.eda import plot, plot_correlation
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. NumExpr defaulting to 8 threads.
df = pd.read_pickle('output/movies_BUDG_REV.pkl.xz')
df
| title | release_date | budget | revenue | runtime | genre1 | genres | |
|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | Drama, History, War |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | Drama, History |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | Comedy, Drama |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | Drama |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | Romance, Adventure, Fantasy |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | Action, Adventure, Science Fiction |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | Comedy, Drama, Fantasy |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | Drama |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | Drama, Thriller, Mystery, Action |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | Comedy |
5815 rows × 7 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5815 entries, 14 to 26121 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 5815 non-null object 1 release_date 5815 non-null datetime64[ns] 2 budget 5815 non-null int64 3 revenue 5815 non-null int64 4 runtime 5810 non-null float64 5 genre1 5814 non-null category 6 genres 5814 non-null object dtypes: category(1), datetime64[ns](1), float64(1), int64(2), object(2) memory usage: 324.1+ KB
# Use dataprep.eda to plot all fields
plot(df)
| Number of Variables | 7 |
|---|---|
| Number of Rows | 5815 |
| Missing Cells | 7 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 0 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 1.1 MB |
| Average Row Size in Memory | 191.5 B |
| Variable Types |
|
| budget is skewed | Skewed |
|---|---|
| revenue is skewed | Skewed |
| runtime is skewed | Skewed |
| title has a high cardinality: 5690 distinct values | High Cardinality |
| genres has a high cardinality: 1430 distinct values | High Cardinality |
release_date¶# dataprep.eda count of movies by date
plot(df, 'release_date')
| Distinct Count | 4115 |
|---|---|
| Unique (%) | 70.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 90.9 KB |
| Minimum | 1915-02-08 00:00:00 |
| Maximum | 2019-12-26 00:00:00 |
# dataprep AVG Budget by date
plot(df, 'release_date','budget')
# Avg Revenue by date
plot(df, 'release_date','revenue')
budget¶# Summary stats
df['budget'].describe().map('{:,.0f}'.format)
count 5,815 mean 32,164,621 std 41,114,852 min 1 25% 6,000,000 50% 18,000,000 75% 40,000,000 max 380,000,000 Name: budget, dtype: object
# dataprep.eda stats and charts
plot(df, 'budget')
| Distinct Count | 766 |
|---|---|
| Unique (%) | 13.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 90.9 KB |
| Mean | 3.2165e+07 |
| Minimum | 1 |
| Maximum | 3.8e+08 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 1 |
|---|---|
| 5-th Percentile | 851757 |
| Q1 | 6e+06 |
| Median | 1.8e+07 |
| Q3 | 4e+07 |
| 95-th Percentile | 1.2e+08 |
| Maximum | 3.8e+08 |
| Range | 3.8e+08 |
| IQR | 3.4e+07 |
| Mean | 3.2165e+07 |
|---|---|
| Standard Deviation | 4.1115e+07 |
| Variance | 1.6904e+15 |
| Sum | 1.8704e+11 |
| Skewness | 2.5618 |
| Kurtosis | 8.3835 |
| Coefficient of Variation | 1.2783 |
# Using matplotlib -- adding semicolon for cleaner chart
df['budget'].plot.hist();
Evaluation: Skewed to the right by a small number of very high budgets.
# matplotlib boxplot
df['budget'].plot.box();
# matplotlib horizontal boxplot resized
df['budget'].plot.box(vert=False, figsize=(12,5));
We have a lot of outliers with extraordinarily high budgets ...
Let's do the math to see what number defines the upper whisker, beyond which movie budgets are considered outliers.
# Get the statistical summary
df['budget'].describe().map('{:,.0f}'.format)
count 5,815 mean 32,164,621 std 41,114,852 min 1 25% 6,000,000 50% 18,000,000 75% 40,000,000 max 380,000,000 Name: budget, dtype: object
Outliers are those whose values are greater than 1.5 IQR above the 75th percentile.
Outliers are those with budget of $91M or greater
revenue¶# Using dataprep.eda
plot(df, 'revenue')
| Distinct Count | 5456 |
|---|---|
| Unique (%) | 93.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 90.9 KB |
| Mean | 9.5662e+07 |
| Minimum | 1 |
| Maximum | 2.7978e+09 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 1 |
|---|---|
| 5-th Percentile | 326434 |
| Q1 | 8.5999e+06 |
| Median | 3.2204e+07 |
| Q3 | 1.0233e+08 |
| 95-th Percentile | 3.9067e+08 |
| Maximum | 2.7978e+09 |
| Range | 2.7978e+09 |
| IQR | 9.3733e+07 |
| Mean | 9.5662e+07 |
|---|---|
| Standard Deviation | 1.7989e+08 |
| Variance | 3.2361e+16 |
| Sum | 5.5627e+11 |
| Skewness | 4.7939 |
| Kurtosis | 36.7958 |
| Coefficient of Variation | 1.8805 |
# Using matplotlib -- adding semicolon for cleaner chart
df['revenue'].plot.hist();
# matplotlib horizontal boxplot resized
df['revenue'].plot.box(vert=False, figsize=(12,5));
# Get the statistical summary
df['revenue'].describe().map('{:,.0f}'.format)
count 5,815 mean 95,661,997 std 179,892,596 min 1 25% 8,599,919 50% 32,204,030 75% 102,332,635 max 2,797,800,564 Name: revenue, dtype: object
genre¶# dataprep overview
plot(df,'genre1')
| Distinct Count | 12 |
|---|---|
| Unique (%) | 0.2% |
| Missing | 1 |
| Missing (%) | 0.0% |
| Memory Size | 52.2 KB |
| Mean | 8.5157 |
|---|---|
| Standard Deviation | 5.5013 |
| Median | 6 |
| Minimum | 5 |
| Maximum | 22 |
| 1st row | Drama |
|---|---|
| 2nd row | Drama |
| 3rd row | Comedy |
| 4th row | Drama |
| 5th row | Action |
| Count | 47838 |
|---|---|
| Lowercase Letter | 40352 |
| Space Separator | 140 |
| Uppercase Letter | 7486 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
# Matplotlib to get a horizontal bar chart, including null values
df['genre1'].value_counts(dropna=False).plot.barh(title = 'Genres Count', x='title', figsize=(7,5)).invert_yaxis()
# Matplotlib to get a horizontal bar chart, including null values
# Multi-line for easier reading
(
df['genre1'].value_counts(dropna=False) # Get a count per category, including null values
.plot.barh(title = 'Genres Count', x='title', figsize=(7,5))
.invert_yaxis() # Fix sort order for horz bar chart
)
# matplotlib scatterplot
df.plot.scatter(title='Budget & Revenue', x='revenue', y='budget');
# dataprep version
# see plot_correlation doc page: https://docs.dataprep.ai/user_guide/eda/plot_correlation.html
plot_correlation(df, 'budget', 'revenue')
# dataprep
plot_correlation(df, 'runtime', 'budget')
plot_correlation(df, 'runtime','revenue')
and organize for more efficient analysis
profit and ratio columnsgenre1 to genregenres columntitle as the index# View current dataframe
df
| title | release_date | budget | revenue | runtime | genre1 | genres | |
|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | Drama, History, War |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | Drama, History |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | Comedy, Drama |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | Drama |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | Romance, Adventure, Fantasy |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | Action, Adventure, Science Fiction |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | Comedy, Drama, Fantasy |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | Drama |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | Drama, Thriller, Mystery, Action |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | Comedy |
5815 rows × 7 columns
profit Column¶df['profit'] = df['revenue'] - df['budget']
df
| title | release_date | budget | revenue | runtime | genre1 | genres | profit | |
|---|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | Drama, History, War | 10900000 |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | Drama, History | -4394751 |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | Comedy, Drama | 2250000 |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | Drama | -699800 |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | Romance, Adventure, Fantasy | 78226 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | Action, Adventure, Science Fiction | 808056021 |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | Comedy, Drama, Fantasy | -23045085 |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | Drama | 4277796 |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | Drama, Thriller, Mystery, Action | 3099072 |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | Comedy | 32484412 |
5815 rows × 8 columns
ratio column¶df['ratio'] = (df['revenue'] / df['budget'])
df
| title | release_date | budget | revenue | runtime | genre1 | genres | profit | ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | Drama, History, War | 10900000 | 110.000000 |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | Drama, History | -4394751 | 0.476488 |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | Comedy, Drama | 2250000 | 10.000000 |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | Drama | -699800 | 0.363818 |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | Romance, Adventure, Fantasy | 78226 | 1.068882 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | Action, Adventure, Science Fiction | 808056021 | 4.232224 |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | Comedy, Drama, Fantasy | -23045085 | 0.757420 |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | Drama | 4277796 | 2.069449 |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | Drama, Thriller, Mystery, Action | 3099072 | 5.400003 |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | Comedy | 32484412 | 17.242206 |
5815 rows × 9 columns
# Round to 2 decimals
df['ratio'] = (df['revenue'] / df['budget']).round(2)
df
| title | release_date | budget | revenue | runtime | genre1 | genres | profit | ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | Drama, History, War | 10900000 | 110.00 |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | Drama, History | -4394751 | 0.48 |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | Comedy, Drama | 2250000 | 10.00 |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | Drama | -699800 | 0.36 |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | Romance, Adventure, Fantasy | 78226 | 1.07 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | Action, Adventure, Science Fiction | 808056021 | 4.23 |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | Comedy, Drama, Fantasy | -23045085 | 0.76 |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | Drama | 4277796 | 2.07 |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | Drama, Thriller, Mystery, Action | 3099072 | 5.40 |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | Comedy | 32484412 | 17.24 |
5815 rows × 9 columns
genre1 to genre (singular) and drop genres¶df.rename(columns = {'genre1':'genre'}, inplace = True)
df
| title | release_date | budget | revenue | runtime | genre | genres | profit | ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | Drama, History, War | 10900000 | 110.00 |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | Drama, History | -4394751 | 0.48 |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | Comedy, Drama | 2250000 | 10.00 |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | Drama | -699800 | 0.36 |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | Romance, Adventure, Fantasy | 78226 | 1.07 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | Action, Adventure, Science Fiction | 808056021 | 4.23 |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | Comedy, Drama, Fantasy | -23045085 | 0.76 |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | Drama | 4277796 | 2.07 |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | Drama, Thriller, Mystery, Action | 3099072 | 5.40 |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | Comedy | 32484412 | 17.24 |
5815 rows × 9 columns
df.drop(columns='genres', inplace=True)
df
| title | release_date | budget | revenue | runtime | genre | profit | ratio | |
|---|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 193.0 | Drama | 10900000 | 110.00 |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | 197.0 | Drama | -4394751 | 0.48 |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 68.0 | Comedy | 2250000 | 10.00 |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | 143.0 | Drama | -699800 | 0.36 |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 149.0 | Action | 78226 | 1.07 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 142.0 | Action | 808056021 | 4.23 |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | 110.0 | Fantasy | -23045085 | 0.76 |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | NaN | Drama | 4277796 | 2.07 |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 135.0 | Action | 3099072 | 5.40 |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 111.0 | Comedy | 32484412 | 17.24 |
5815 rows × 8 columns
# Get list of current columns
list(df.columns)
['title', 'release_date', 'budget', 'revenue', 'runtime', 'genre', 'profit', 'ratio']
# Move profit and ratio before runtime
df = df[[
'title',
'release_date',
'budget',
'revenue',
'profit',
'ratio',
'runtime',
'genre']]
df
| title | release_date | budget | revenue | profit | ratio | runtime | genre | |
|---|---|---|---|---|---|---|---|---|
| 14 | The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 10900000 | 110.00 | 193.0 | Drama |
| 28 | Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | -4394751 | 0.48 | 197.0 | Drama |
| 47 | The Kid | 1921-02-06 | 250000 | 2500000 | 2250000 | 10.00 | 68.0 | Comedy |
| 57 | Foolish Wives | 1922-01-11 | 1100000 | 400200 | -699800 | 0.36 | 143.0 | Drama |
| 78 | The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 78226 | 1.07 | 149.0 | Action |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26065 | Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 808056021 | 4.23 | 142.0 | Action |
| 26070 | Cats | 2019-12-19 | 95000000 | 71954915 | -23045085 | 0.76 | 110.0 | Fantasy |
| 26074 | The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | 4277796 | 2.07 | NaN | Drama |
| 26092 | Driving Licence | 2019-12-20 | 704334 | 3803406 | 3099072 | 5.40 | 135.0 | Action |
| 26121 | My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 32484412 | 17.24 | 111.0 | Comedy |
5815 rows × 8 columns
title as the dataframe index¶df.set_index('title', inplace=True)
df
| release_date | budget | revenue | profit | ratio | runtime | genre | |
|---|---|---|---|---|---|---|---|
| title | |||||||
| The Birth of a Nation | 1915-02-08 | 100000 | 11000000 | 10900000 | 110.00 | 193.0 | Drama |
| Intolerance: Love's Struggle Throughout the Ages | 1916-09-04 | 8394751 | 4000000 | -4394751 | 0.48 | 197.0 | Drama |
| The Kid | 1921-02-06 | 250000 | 2500000 | 2250000 | 10.00 | 68.0 | Comedy |
| Foolish Wives | 1922-01-11 | 1100000 | 400200 | -699800 | 0.36 | 143.0 | Drama |
| The Thief of Bagdad | 1924-03-18 | 1135654 | 1213880 | 78226 | 1.07 | 149.0 | Action |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Star Wars: The Rise of Skywalker | 2019-12-18 | 250000000 | 1058056021 | 808056021 | 4.23 | 142.0 | Action |
| Cats | 2019-12-19 | 95000000 | 71954915 | -23045085 | 0.76 | 110.0 | Fantasy |
| The Fortune Goddess | 2019-12-19 | 4000000 | 8277796 | 4277796 | 2.07 | NaN | Drama |
| Driving Licence | 2019-12-20 | 704334 | 3803406 | 3099072 | 5.40 | 135.0 | Action |
| My Mom is a Character 3 | 2019-12-26 | 2000000 | 34484412 | 32484412 | 17.24 | 111.0 | Comedy |
5815 rows × 7 columns
profit¶# dataprep.eda stats and plots
plot(df, 'profit')
| Distinct Count | 5618 |
|---|---|
| Unique (%) | 96.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 456.2 KB |
| Mean | 6.3497e+07 |
| Minimum | -1.6571e+08 |
| Maximum | 2.551e+09 |
| Zeros | 7 |
| Zeros (%) | 0.1% |
| Negatives | 1665 |
| Negatives (%) | 28.6% |
| Minimum | -1.6571e+08 |
|---|---|
| 5-th Percentile | -2.0786e+07 |
| Q1 | -1.2676e+06 |
| Median | 1.29e+07 |
| Q3 | 6.5642e+07 |
| 95-th Percentile | 3.0623e+08 |
| Maximum | 2.551e+09 |
| Range | 2.7167e+09 |
| IQR | 6.6909e+07 |
| Mean | 6.3497e+07 |
|---|---|
| Standard Deviation | 1.5245e+08 |
| Variance | 2.3241e+16 |
| Sum | 3.6924e+11 |
| Skewness | 5.2512 |
| Kurtosis | 44.6967 |
| Coefficient of Variation | 2.4009 |
# Get the statistical summary
df['profit'].describe().map('{:,.0f}'.format)
count 5,815 mean 63,497,376 std 152,448,864 min -165,710,090 25% -1,267,568 50% 12,900,000 75% 65,641,752 max 2,550,965,087 Name: profit, dtype: object
ratio¶# dataprep.eda stats and plots
plot(df, 'ratio')
| Distinct Count | 1300 |
|---|---|
| Unique (%) | 22.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 456.2 KB |
| Mean | 437.9972 |
| Minimum | 0 |
| Maximum | 1.2192e+06 |
| Zeros | 60 |
| Zeros (%) | 1.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0.09 |
| Q1 | 0.84 |
| Median | 2.1 |
| Q3 | 4.345 |
| 95-th Percentile | 15.639 |
| Maximum | 1.2192e+06 |
| Range | 1.2192e+06 |
| IQR | 3.505 |
| Mean | 437.9972 |
|---|---|
| Standard Deviation | 20917.8294 |
| Variance | 4.3756e+08 |
| Sum | 2.547e+06 |
| Skewness | 53.0475 |
| Kurtosis | 2879.1011 |
| Coefficient of Variation | 47.7579 |
# matplotlib horizontal boxplot resized
df['ratio'].plot.box(vert=False, figsize=(12,5));
# Define a list of desired fields
cols = ['budget','revenue','profit','ratio']
# View the top 20 movies sorted by ratio
df[cols].sort_values('ratio', ascending=False).head(20)
| budget | revenue | profit | ratio | |
|---|---|---|---|---|
| title | ||||
| Maurice | 2 | 2438304 | 2438302 | 1219152.00 |
| Nurse 3-D | 10 | 10000000 | 9999990 | 1000000.00 |
| superlorpez | 52 | 12496155 | 12496103 | 240310.67 |
| From Prada to Nada | 93 | 2500000 | 2499907 | 26881.72 |
| Paranormal Activity | 15000 | 193355800 | 193340800 | 12890.39 |
| One Cut of the Dead | 30000 | 200000000 | 199970000 | 6666.67 |
| Tarnation | 218 | 1162014 | 1161796 | 5330.34 |
| The Blair Witch Project | 60000 | 248000000 | 247940000 | 4133.33 |
| Love, Wedding, Marriage | 1 | 1378 | 1377 | 1378.00 |
| Pink Flamingos | 12000 | 6000000 | 5988000 | 500.00 |
| Secret Superstar | 286284 | 137416709 | 137130425 | 480.00 |
| Super Size Me | 65000 | 28575078 | 28510078 | 439.62 |
| The Gallows | 100000 | 42664410 | 42564410 | 426.64 |
| Open Water | 130000 | 54667954 | 54537954 | 420.52 |
| The Texas Chain Saw Massacre | 85000 | 30859000 | 30774000 | 363.05 |
| Bambi | 858000 | 267447150 | 266589150 | 311.71 |
| Night of the Living Dead | 114000 | 30000000 | 29886000 | 263.16 |
| The Mistress | 22361 | 5876918 | 5854557 | 262.82 |
| Mad Max | 400000 | 100000000 | 99600000 | 250.00 |
| The Legend of Boggy Creek | 100000 | 22000000 | 21900000 | 220.00 |
# Format the numbers for easier reading
cols = ['budget','revenue','profit','ratio']
# Wrap in parens to keep multiple lines together
(
df[cols].sort_values('ratio', ascending=False)
.head(20)
.apply(lambda s: s.apply('{:,.0f}'.format))
)
| budget | revenue | profit | ratio | |
|---|---|---|---|---|
| title | ||||
| Maurice | 2 | 2,438,304 | 2,438,302 | 1,219,152 |
| Nurse 3-D | 10 | 10,000,000 | 9,999,990 | 1,000,000 |
| superlorpez | 52 | 12,496,155 | 12,496,103 | 240,311 |
| From Prada to Nada | 93 | 2,500,000 | 2,499,907 | 26,882 |
| Paranormal Activity | 15,000 | 193,355,800 | 193,340,800 | 12,890 |
| One Cut of the Dead | 30,000 | 200,000,000 | 199,970,000 | 6,667 |
| Tarnation | 218 | 1,162,014 | 1,161,796 | 5,330 |
| The Blair Witch Project | 60,000 | 248,000,000 | 247,940,000 | 4,133 |
| Love, Wedding, Marriage | 1 | 1,378 | 1,377 | 1,378 |
| Pink Flamingos | 12,000 | 6,000,000 | 5,988,000 | 500 |
| Secret Superstar | 286,284 | 137,416,709 | 137,130,425 | 480 |
| Super Size Me | 65,000 | 28,575,078 | 28,510,078 | 440 |
| The Gallows | 100,000 | 42,664,410 | 42,564,410 | 427 |
| Open Water | 130,000 | 54,667,954 | 54,537,954 | 421 |
| The Texas Chain Saw Massacre | 85,000 | 30,859,000 | 30,774,000 | 363 |
| Bambi | 858,000 | 267,447,150 | 266,589,150 | 312 |
| Night of the Living Dead | 114,000 | 30,000,000 | 29,886,000 | 263 |
| The Mistress | 22,361 | 5,876,918 | 5,854,557 | 263 |
| Mad Max | 400,000 | 100,000,000 | 99,600,000 | 250 |
| The Legend of Boggy Creek | 100,000 | 22,000,000 | 21,900,000 | 220 |
Evaluation
Filter for movies with budgets GTE $50K
# Filter for movies with budgets GTE $50K
df_budg50k = df[df['budget'] >= 50000]
df_budg50k.sort_values('budget')
| release_date | budget | revenue | profit | ratio | runtime | genre | |
|---|---|---|---|---|---|---|---|
| title | |||||||
| K-12 | 2019-09-05 | 50000 | 359377 | 309377 | 7.19 | 92.0 | Horror |
| The Isle | 2000-04-22 | 50000 | 21075 | -28925 | 0.42 | 90.0 | Crime/Mystery/Thriller |
| The Signal | 2008-02-22 | 50000 | 249905 | 199905 | 5.00 | 103.0 | Horror |
| Pi | 1998-07-10 | 60000 | 3221152 | 3161152 | 53.69 | 84.0 | Crime/Mystery/Thriller |
| Stockholm | 2013-04-25 | 60000 | 50000 | -10000 | 0.83 | 90.0 | Romance |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Justice League | 2017-11-15 | 300000000 | 657924295 | 357924295 | 2.19 | 120.0 | Action |
| Avengers: Infinity War | 2018-04-25 | 300000000 | 2046239637 | 1746239637 | 6.82 | 149.0 | Action |
| Pirates of the Caribbean: At World's End | 2007-05-19 | 300000000 | 961000000 | 661000000 | 3.20 | 169.0 | Action |
| Avengers: Endgame | 2019-04-24 | 356000000 | 2797800564 | 2441800564 | 7.86 | 181.0 | Action |
| Pirates of the Caribbean: On Stranger Tides | 2011-05-14 | 380000000 | 1045713802 | 665713802 | 2.75 | 136.0 | Action |
5753 rows × 7 columns
cols = ['budget','revenue','profit','ratio']
# Wrap in parens to keep multiple lines together
(
df_budg50k[cols].sort_values('ratio', ascending=False)
.head(20)
.apply(lambda s: s.apply('{:,.0f}'.format))
)
| budget | revenue | profit | ratio | |
|---|---|---|---|---|
| title | ||||
| The Blair Witch Project | 60,000 | 248,000,000 | 247,940,000 | 4,133 |
| Secret Superstar | 286,284 | 137,416,709 | 137,130,425 | 480 |
| Super Size Me | 65,000 | 28,575,078 | 28,510,078 | 440 |
| The Gallows | 100,000 | 42,664,410 | 42,564,410 | 427 |
| Open Water | 130,000 | 54,667,954 | 54,537,954 | 421 |
| The Texas Chain Saw Massacre | 85,000 | 30,859,000 | 30,774,000 | 363 |
| Bambi | 858,000 | 267,447,150 | 266,589,150 | 312 |
| Night of the Living Dead | 114,000 | 30,000,000 | 29,886,000 | 263 |
| Mad Max | 400,000 | 100,000,000 | 99,600,000 | 250 |
| The Legend of Boggy Creek | 100,000 | 22,000,000 | 21,900,000 | 220 |
| The Way of the Dragon | 130,000 | 27,000,000 | 26,870,000 | 208 |
| Alice in Wonderland | 3,000,000 | 572,000,000 | 569,000,000 | 191 |
| American Graffiti | 777,000 | 140,000,000 | 139,223,000 | 180 |
| Let's Do It Again | 70,000 | 11,800,000 | 11,730,000 | 169 |
| Once | 160,000 | 20,710,513 | 20,550,513 | 129 |
| Snow White and the Seven Dwarfs | 1,488,423 | 184,925,486 | 183,437,063 | 124 |
| Fritz the Cat | 850,000 | 100,000,000 | 99,150,000 | 118 |
| Rocky | 1,000,000 | 117,235,147 | 116,235,147 | 117 |
| Napoleon Dynamite | 400,000 | 46,118,097 | 45,718,097 | 115 |
| The Birth of a Nation | 100,000 | 11,000,000 | 10,900,000 | 110 |
Pivot table and charts to show per genre:
# Test a pivot
cols = ['budget','revenue','profit','ratio']
df_budg50k.pivot_table(values=cols, index='genre', aggfunc='mean')
| budget | profit | ratio | revenue | |
|---|---|---|---|---|
| genre | ||||
| Action | 5.094985e+07 | 9.963509e+07 | 3.813278 | 1.505849e+08 |
| Animation | 6.408742e+07 | 1.680774e+08 | 6.793761 | 2.321648e+08 |
| Comedy | 2.000483e+07 | 4.245366e+07 | 4.840151 | 6.245849e+07 |
| Crime/Mystery/Thriller | 2.291217e+07 | 3.223725e+07 | 3.414494 | 5.514942e+07 |
| Documentary | 4.998760e+06 | 1.881403e+07 | 16.631200 | 2.381279e+07 |
| Drama | 1.775218e+07 | 3.064621e+07 | 5.187140 | 4.839839e+07 |
| Family | 4.668950e+07 | 8.640023e+07 | 3.572482 | 1.330897e+08 |
| Fantasy | 3.215551e+07 | 5.802558e+07 | 3.218211 | 9.018109e+07 |
| Horror | 1.695193e+07 | 3.706127e+07 | 15.138917 | 5.401320e+07 |
| Other | 9.733333e+06 | 1.168017e+07 | 6.606667 | 2.141351e+07 |
| Romance | 1.742700e+07 | 3.391465e+07 | 5.344539 | 5.134164e+07 |
| Science Fiction | 3.103141e+07 | 3.426579e+07 | 2.510288 | 6.529720e+07 |
# Store as variable with organized columns
cols = ['budget','revenue','profit','ratio']
Genre_AVGs = df_budg50k.pivot_table(values=cols, index='genre', aggfunc='mean')
Genre_AVGs = Genre_AVGs[['budget','revenue','profit','ratio']]
Genre_AVGs
| budget | revenue | profit | ratio | |
|---|---|---|---|---|
| genre | ||||
| Action | 5.094985e+07 | 1.505849e+08 | 9.963509e+07 | 3.813278 |
| Animation | 6.408742e+07 | 2.321648e+08 | 1.680774e+08 | 6.793761 |
| Comedy | 2.000483e+07 | 6.245849e+07 | 4.245366e+07 | 4.840151 |
| Crime/Mystery/Thriller | 2.291217e+07 | 5.514942e+07 | 3.223725e+07 | 3.414494 |
| Documentary | 4.998760e+06 | 2.381279e+07 | 1.881403e+07 | 16.631200 |
| Drama | 1.775218e+07 | 4.839839e+07 | 3.064621e+07 | 5.187140 |
| Family | 4.668950e+07 | 1.330897e+08 | 8.640023e+07 | 3.572482 |
| Fantasy | 3.215551e+07 | 9.018109e+07 | 5.802558e+07 | 3.218211 |
| Horror | 1.695193e+07 | 5.401320e+07 | 3.706127e+07 | 15.138917 |
| Other | 9.733333e+06 | 2.141351e+07 | 1.168017e+07 | 6.606667 |
| Romance | 1.742700e+07 | 5.134164e+07 | 3.391465e+07 | 5.344539 |
| Science Fiction | 3.103141e+07 | 6.529720e+07 | 3.426579e+07 | 2.510288 |
Genre_AVGs.sort_values('ratio', ascending=False).apply(lambda s: s.apply('{:,.0f}'.format))
| budget | revenue | profit | ratio | |
|---|---|---|---|---|
| genre | ||||
| Documentary | 4,998,760 | 23,812,789 | 18,814,029 | 17 |
| Horror | 16,951,929 | 54,013,200 | 37,061,271 | 15 |
| Animation | 64,087,424 | 232,164,807 | 168,077,383 | 7 |
| Other | 9,733,333 | 21,413,508 | 11,680,175 | 7 |
| Romance | 17,426,995 | 51,341,641 | 33,914,645 | 5 |
| Drama | 17,752,180 | 48,398,387 | 30,646,207 | 5 |
| Comedy | 20,004,834 | 62,458,494 | 42,453,660 | 5 |
| Action | 50,949,847 | 150,584,937 | 99,635,091 | 4 |
| Family | 46,689,497 | 133,089,723 | 86,400,226 | 4 |
| Crime/Mystery/Thriller | 22,912,165 | 55,149,419 | 32,237,254 | 3 |
| Fantasy | 32,155,510 | 90,181,092 | 58,025,582 | 3 |
| Science Fiction | 31,031,410 | 65,297,200 | 34,265,790 | 3 |
# Matplotlib to get a horizontal bar chart, including null values
# Multi-line for easier reading
(
Genre_AVGs['budget'].sort_values(ascending=False)
.plot.barh(title = 'AVG Budget by Genre', figsize=(7,5))
.invert_yaxis() # Fix sort order for horz bar chart
)
# Matplotlib to get a horizontal bar chart, including null values
# Multi-line for easier reading
(
Genre_AVGs['revenue'].sort_values(ascending=False)
.plot.barh(title = 'AVG Revenue by Genre', figsize=(7,5))
.invert_yaxis() # Fix sort order for horz bar chart
)
(
Genre_AVGs['profit'].sort_values(ascending=False)
.plot.barh(title = 'AVG Profit by Genre', figsize=(7,5))
.invert_yaxis() # Fix sort order for horz bar chart
)
(
Genre_AVGs['ratio'].sort_values(ascending=False)
.plot.barh(title = 'AVG Ratio by Genre', figsize=(7,5))
.invert_yaxis() # Fix sort order for horz bar chart
)